Some housekeeping: In case you don’t have the necessary packages installed, run this script to do so.
As with any concept, machine learning may have a slightly different definition, depending on whom you ask. A little compilation of definitions by academics and practioneers alike:
Sometimes, the lines are blurry, given the vast amount of algorithms and techniques, which constantly expands.
Tasks related to pattern recognition and data exploration, in dase there yet does not exist a right answer or problem structure. Main application
Dimensionality reduction techniques are foremost useful to (you might see it coming) reduce the dimensionality of our data. So, what does that mean? And why should we want to do that?
Dimensions here is a synonym for variables, so what we want to really do is have less variables. To do that, we have to find ways to express the same amount of information with fewer, but more information rich variables. This is particularly useful to:
The type of analysis to be performed depends on the data set formats and structures. The most commonly used DR techniques are:
The mathematics underlying it are somewhat complex, so I won’t go into too much detail, but the basics of PCA are as follows: you take a dataset with many variables, and you simplify that dataset by turning your original variables into a smaller number of “Principal Components”.
But what are these exactly? Principal Components are the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out. This means that we try to find the straight line that best spreads the data out when it is projected along it. This is the first principal component, the straight line that shows the most substantial variance in the data.
Where many variables correlate with one another, they will all contribute strongly to the same principal component. Each principal component sums up a certain percentage of the total variation in the dataset. Where your initial variables are strongly correlated with one another, you will be able to approximate most of the complexity in your dataset with just a few principal components. Usually, the first principal component captures the main similarity in your data, the second the main difference.
These principal components can be computed via Eigenvalues and Eigenvectors. Just like many things in life, eigenvectors, and eigenvalues come in pairs: every eigenvector has a corresponding eigenvalue. Simply put, an eigenvector is a direction, such as “vertical” or “45 degrees”, while an eigenvalue is a number telling you how much variance there is in the data in that direction. The eigenvector with the highest eigenvalue is, therefore, the first principal component. The number of eigenvalues and eigenvectors that exits is equal to the number of dimensions the data set has. Consequently, we can reframe a dataset in terms of these eigenvectors and eigenvalues without changing the underlying information.
Note that reframing a dataset regarding a set of eigenvalues and eigenvectors does not entail changing the data itself, you’re just looking at it from a different angle, which should represent the data better.
Allright, lets load some data. Here, we will draw from some own work, where we explore the life of digital nomads. The paper is not written, but the preliminary work is summarized in this presentation. You probably already know the data from NomadList. Here, we look at the 2017 crawl of city data, which compiles the digital nomads ranking of cities according to a couple of dimensions. Lets take a look.
Roman’s web-crawld ata is always a bit messy, so we do some little cosmetics upfront.
Lets take a look:
Quite a set of interesting features, which are all numerically coded. Lets select the one we want to analyze and organize them a bit. Since it’s a lot of variables, I afterwards select only a subset on which we do some graphical exploration.
Ok, time for some exploration. Here I will introduce the GGally package, a wrapper for ggplot2 which has some functions for very nice visual summaries in matrix form.
Attaching package: <U+393C><U+3E31>GGally<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:dplyr<U+393C><U+3E32>:
nasa
First, lets look at a classical correlation matrix.
Even cooler, the ggpairs function creates you a scatterplot matrix plus all variable distributions and correlations. Before I used the package PerformanceAnalytics for that, but I like the ggplot-style more.
plot: [1,1] [===----------------------------------------------------------------------------------------------] 3% est: 0s
plot: [1,2] [=====--------------------------------------------------------------------------------------------] 6% est: 2s
plot: [1,3] [========-----------------------------------------------------------------------------------------] 8% est: 3s
plot: [1,4] [===========--------------------------------------------------------------------------------------] 11% est: 3s
plot: [1,5] [=============------------------------------------------------------------------------------------] 14% est: 3s
plot: [1,6] [================---------------------------------------------------------------------------------] 17% est: 3s
plot: [2,1] [===================------------------------------------------------------------------------------] 19% est: 2s
plot: [2,2] [======================---------------------------------------------------------------------------] 22% est: 3s
plot: [2,3] [========================-------------------------------------------------------------------------] 25% est: 3s
plot: [2,4] [===========================----------------------------------------------------------------------] 28% est: 2s
plot: [2,5] [==============================-------------------------------------------------------------------] 31% est: 2s
plot: [2,6] [================================-----------------------------------------------------------------] 33% est: 2s
plot: [3,1] [===================================--------------------------------------------------------------] 36% est: 2s
plot: [3,2] [======================================-----------------------------------------------------------] 39% est: 2s
plot: [3,3] [========================================---------------------------------------------------------] 42% est: 2s
plot: [3,4] [===========================================------------------------------------------------------] 44% est: 2s
plot: [3,5] [==============================================---------------------------------------------------] 47% est: 2s
plot: [3,6] [================================================-------------------------------------------------] 50% est: 2s
plot: [4,1] [===================================================----------------------------------------------] 53% est: 2s
plot: [4,2] [======================================================-------------------------------------------] 56% est: 2s
plot: [4,3] [=========================================================----------------------------------------] 58% est: 2s
plot: [4,4] [===========================================================--------------------------------------] 61% est: 1s
plot: [4,5] [==============================================================-----------------------------------] 64% est: 1s
plot: [4,6] [=================================================================--------------------------------] 67% est: 1s
plot: [5,1] [===================================================================------------------------------] 69% est: 1s
plot: [5,2] [======================================================================---------------------------] 72% est: 1s
plot: [5,3] [=========================================================================------------------------] 75% est: 1s
plot: [5,4] [===========================================================================----------------------] 78% est: 1s
plot: [5,5] [==============================================================================-------------------] 81% est: 1s
plot: [5,6] [=================================================================================----------------] 83% est: 1s
plot: [6,1] [====================================================================================-------------] 86% est: 1s
plot: [6,2] [======================================================================================-----------] 89% est: 0s
plot: [6,3] [=========================================================================================--------] 92% est: 0s
plot: [6,4] [============================================================================================-----] 94% est: 0s
plot: [6,5] [==============================================================================================---] 97% est: 0s
plot: [6,6] [=================================================================================================]100% est: 0s
To remind you, component scores cannot be computed on missing features. So lets impute them. It’s a good point to introduce you to some neath imputation techniques. First, the package VIM has some nice imputation functions, but also some nice diagnistic plots.
Loading required package: colorspace
Loading required package: grid
VIM is ready to use.
Since version 4.0.0 the GUI is in its own package VIMGUI.
Please use the package to use the new (and old) GUI.
Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
Attaching package: <U+393C><U+3E31>VIM<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:datasets<U+393C><U+3E32>:
sleep
Variables sorted by number of missings:
For the real imputation, I prefer the mice package, which works with a neural network “under the hood”. Here, every feature is sequentially predicted by all other existing features in an iterative process. Since this process involves some stochachics, I define a seed upfront for reproducible results.
iter imp variable
1 1 freedom_score peace_score fragile_states_index press_freedom_index
2 1 freedom_score peace_score fragile_states_index press_freedom_index
3 1 freedom_score peace_score fragile_states_index press_freedom_index
4 1 freedom_score peace_score fragile_states_index press_freedom_index
5 1 freedom_score peace_score fragile_states_index press_freedom_index
6 1 freedom_score peace_score fragile_states_index press_freedom_index
7 1 freedom_score peace_score fragile_states_index press_freedom_index
8 1 freedom_score peace_score fragile_states_index press_freedom_index
9 1 freedom_score peace_score fragile_states_index press_freedom_index
10 1 freedom_score peace_score fragile_states_index press_freedom_index
11 1 freedom_score peace_score fragile_states_index press_freedom_index
12 1 freedom_score peace_score fragile_states_index press_freedom_index
13 1 freedom_score peace_score fragile_states_index press_freedom_index
14 1 freedom_score peace_score fragile_states_index press_freedom_index
15 1 freedom_score peace_score fragile_states_index press_freedom_index
16 1 freedom_score peace_score fragile_states_index press_freedom_index
17 1 freedom_score peace_score fragile_states_index press_freedom_index
18 1 freedom_score peace_score fragile_states_index press_freedom_index
19 1 freedom_score peace_score fragile_states_index press_freedom_index
20 1 freedom_score peace_score fragile_states_index press_freedom_index
21 1 freedom_score peace_score fragile_states_index press_freedom_index
22 1 freedom_score peace_score fragile_states_index press_freedom_index
23 1 freedom_score peace_score fragile_states_index press_freedom_index
24 1 freedom_score peace_score fragile_states_index press_freedom_index
25 1 freedom_score peace_score fragile_states_index press_freedom_index
26 1 freedom_score peace_score fragile_states_index press_freedom_index
27 1 freedom_score peace_score fragile_states_index press_freedom_index
28 1 freedom_score peace_score fragile_states_index press_freedom_index
29 1 freedom_score peace_score fragile_states_index press_freedom_index
30 1 freedom_score peace_score fragile_states_index press_freedom_index
31 1 freedom_score peace_score fragile_states_index press_freedom_index
32 1 freedom_score peace_score fragile_states_index press_freedom_index
33 1 freedom_score peace_score fragile_states_index press_freedom_index
34 1 freedom_score peace_score fragile_states_index press_freedom_index
35 1 freedom_score peace_score fragile_states_index press_freedom_index
36 1 freedom_score peace_score fragile_states_index press_freedom_index
37 1 freedom_score peace_score fragile_states_index press_freedom_index
38 1 freedom_score peace_score fragile_states_index press_freedom_index
39 1 freedom_score peace_score fragile_states_index press_freedom_index
40 1 freedom_score peace_score fragile_states_index press_freedom_index
41 1 freedom_score peace_score fragile_states_index press_freedom_index
42 1 freedom_score peace_score fragile_states_index press_freedom_index
43 1 freedom_score peace_score fragile_states_index press_freedom_index
44 1 freedom_score peace_score fragile_states_index press_freedom_index
45 1 freedom_score peace_score fragile_states_index press_freedom_index
46 1 freedom_score peace_score fragile_states_index press_freedom_index
47 1 freedom_score peace_score fragile_states_index press_freedom_index
48 1 freedom_score peace_score fragile_states_index press_freedom_index
49 1 freedom_score peace_score fragile_states_index press_freedom_index
50 1 freedom_score peace_score fragile_states_index press_freedom_index
51 1 freedom_score peace_score fragile_states_index press_freedom_index
52 1 freedom_score peace_score fragile_states_index press_freedom_index
53 1 freedom_score peace_score fragile_states_index press_freedom_index
54 1 freedom_score peace_score fragile_states_index press_freedom_index
55 1 freedom_score peace_score fragile_states_index press_freedom_index
56 1 freedom_score peace_score fragile_states_index press_freedom_index
57 1 freedom_score peace_score fragile_states_index press_freedom_index
58 1 freedom_score peace_score fragile_states_index press_freedom_index
59 1 freedom_score peace_score fragile_states_index press_freedom_index
60 1 freedom_score peace_score fragile_states_index press_freedom_index
61 1 freedom_score peace_score fragile_states_index press_freedom_index
62 1 freedom_score peace_score fragile_states_index press_freedom_index
63 1 freedom_score peace_score fragile_states_index press_freedom_index
64 1 freedom_score peace_score fragile_states_index press_freedom_index
65 1 freedom_score peace_score fragile_states_index press_freedom_index
66 1 freedom_score peace_score fragile_states_index press_freedom_index
67 1 freedom_score peace_score fragile_states_index press_freedom_index
68 1 freedom_score peace_score fragile_states_index press_freedom_index
69 1 freedom_score peace_score fragile_states_index press_freedom_index
70 1 freedom_score peace_score fragile_states_index press_freedom_index
71 1 freedom_score peace_score fragile_states_index press_freedom_index
72 1 freedom_score peace_score fragile_states_index press_freedom_index
73 1 freedom_score peace_score fragile_states_index press_freedom_index
74 1 freedom_score peace_score fragile_states_index press_freedom_index
75 1 freedom_score peace_score fragile_states_index press_freedom_index
76 1 freedom_score peace_score fragile_states_index press_freedom_index
77 1 freedom_score peace_score fragile_states_index press_freedom_index
78 1 freedom_score peace_score fragile_states_index press_freedom_index
79 1 freedom_score peace_score fragile_states_index press_freedom_index
80 1 freedom_score peace_score fragile_states_index press_freedom_index
81 1 freedom_score peace_score fragile_states_index press_freedom_index
82 1 freedom_score peace_score fragile_states_index press_freedom_index
83 1 freedom_score peace_score fragile_states_index press_freedom_index
84 1 freedom_score peace_score fragile_states_index press_freedom_index
85 1 freedom_score peace_score fragile_states_index press_freedom_index
86 1 freedom_score peace_score fragile_states_index press_freedom_index
87 1 freedom_score peace_score fragile_states_index press_freedom_index
88 1 freedom_score peace_score fragile_states_index press_freedom_index
89 1 freedom_score peace_score fragile_states_index press_freedom_index
90 1 freedom_score peace_score fragile_states_index press_freedom_index
91 1 freedom_score peace_score fragile_states_index press_freedom_index
92 1 freedom_score peace_score fragile_states_index press_freedom_index
93 1 freedom_score peace_score fragile_states_index press_freedom_index
94 1 freedom_score peace_score fragile_states_index press_freedom_index
95 1 freedom_score peace_score fragile_states_index press_freedom_index
96 1 freedom_score peace_score fragile_states_index press_freedom_index
97 1 freedom_score peace_score fragile_states_index press_freedom_index
98 1 freedom_score peace_score fragile_states_index press_freedom_index
99 1 freedom_score peace_score fragile_states_index press_freedom_index
100 1 freedom_score peace_score fragile_states_index press_freedom_index
Let’s look at the distribution of the imputed vs. the existing features.
I would say, good enough. Lets take them!
To execute the PCA, we’ll here use the FactoMineR package to compute PCA, and factoextra for extracting and visualizing the results. FactoMineR is a great and my favorite package for computing principal component methods in R. It’s very easy to use and very well documented. There are other alternatives around, but I since quite some time find it to be the most powerful and convenient one. factoextra is just a convenient ggplot wrapper that easily produces nice and informative diagnistic plots for a variety of DR and clustering techniques.
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Lets do that. Notice the scale.unit = TRUE argument, which you should ALWAYS use. Afterwards, we take a look at the resulting list object.
List of 5
$ eig : num [1:21, 1:3] 9.062 1.976 1.312 1.125 0.931 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:21] "comp 1" "comp 2" "comp 3" "comp 4" ...
.. ..$ : chr [1:3] "eigenvalue" "percentage of variance" "cumulative percentage of variance"
$ var :List of 4
..$ coord : num [1:21, 1:5] 0.656 0.421 0.595 0.747 0.747 ...
.. ..- attr(*, "dimnames")=List of 2
..$ cor : num [1:21, 1:5] 0.656 0.421 0.595 0.747 0.747 ...
.. ..- attr(*, "dimnames")=List of 2
..$ cos2 : num [1:21, 1:5] 0.431 0.177 0.354 0.559 0.559 ...
.. ..- attr(*, "dimnames")=List of 2
..$ contrib: num [1:21, 1:5] 4.75 1.95 3.9 6.16 6.16 ...
.. ..- attr(*, "dimnames")=List of 2
$ ind :List of 4
..$ coord : num [1:781, 1:5] 0.465 -2.067 2.273 1.936 4.243 ...
.. ..- attr(*, "dimnames")=List of 2
..$ cos2 : num [1:781, 1:5] 0.0125 0.308 0.2003 0.1424 0.3431 ...
.. ..- attr(*, "dimnames")=List of 2
..$ contrib: num [1:781, 1:5] 0.00306 0.06037 0.07297 0.05294 0.25431 ...
.. ..- attr(*, "dimnames")=List of 2
..$ dist : Named num [1:781] 4.16 3.72 5.08 5.13 7.24 ...
.. ..- attr(*, "names")= chr [1:781] "1" "2" "3" "4" ...
$ svd :List of 3
..$ vs: num [1:21] 3.01 1.406 1.145 1.061 0.965 ...
..$ U : num [1:781, 1:5] 0.154 -0.687 0.755 0.643 1.409 ...
..$ V : num [1:21, 1:5] 0.218 0.14 0.198 0.248 0.248 ...
$ call:List of 9
..$ row.w : num [1:781] 0.00128 0.00128 0.00128 0.00128 0.00128 ...
..$ col.w : num [1:21] 1 1 1 1 1 1 1 1 1 1 ...
..$ scale.unit: logi TRUE
..$ ncp : num 5
..$ centre : num [1:21] 2332.3 210.4 1880.4 3.3 3.3 ...
..$ ecart.type: num [1:21] 1116.82 173.87 1264.6 1.98 1.98 ...
..$ X :'data.frame': 781 obs. of 21 variables:
.. ..$ cost_nomad : int [1:781] 1364 777 1639 1545 3028 3238 2554 3503 3427 2245 ...
.. ..$ cost_coworking : num [1:781] 152.4 98.9 159.1 47 200 ...
.. ..$ cost_expat : int [1:781] 1273 780 1653 1640 3309 4325 2197 2691 3764 1859 ...
.. ..$ coffee_in_cafe : num [1:781] 1.73 0.85 1.99 1.88 5 4 5.38 5 5 4.03 ...
.. ..$ cost_beer : num [1:781] 1.73 0.85 1.99 1.88 5 4 5.38 5 5 4.03 ...
.. ..$ places_to_work : num [1:781] 1 0.8 1 1 1 1 1 1 1 0.8 ...
.. ..$ free_wifi_available : num [1:781] 0.4 0.6 0.6 1 0.6 1 0.6 0.4 1 0.24 ...
.. ..$ internet_speed : int [1:781] 31 14 15 16 118 81 18 23 55 24 ...
.. ..$ freedom_score : num [1:781] 0.6 0.2 0.8 0.6 0.6 0.6 0.8 0.6 0.6 0.8 ...
.. ..$ peace_score : num [1:781] 0.8 0.4 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 ...
.. ..$ safety : num [1:781] 0.6 0.8 0.8 1 0.73 0.73 0.8 0.8 0.6 0.8 ...
.. ..$ fragile_states_index : num [1:781] 52.7 78.8 40.8 35.1 34 34 39.8 34 34 39.8 ...
.. ..$ press_freedom_index : num [1:781] 28.2 44.5 16.7 24.4 22.5 ...
.. ..$ female_friendly : num [1:781] 1 0.8 1 1 0.8 0.8 0.8 0.8 0.8 0.8 ...
.. ..$ lgbt_friendly : num [1:781] 0.27 0.6 0.6 0.8 0.6 1 1 0.8 0.8 1 ...
.. ..$ friendly_to_foreigners: num [1:781] 0.6 0.6 0.8 0.8 0.8 0.8 0.8 1 1 0.8 ...
.. ..$ racism : num [1:781] 0.4 0.4 0.42 0 0.8 0.8 0.6 0.8 0.8 1 ...
.. ..$ leisure : num [1:781] 0.8 0.62 1 1 1 1 0.6 1 0.6 0.78 ...
.. ..$ life_score : num [1:781] 0.86 0.75 0.83 0.93 0.95 1 0.88 0.95 0.92 0.85 ...
.. ..$ nightlife : num [1:781] 1 0.4 1 0.6 1 1 0.8 1 1 0.8 ...
.. ..$ weed : int [1:781] 0 0 1 0 0 0 1 1 1 1 ...
..$ row.w.init: num [1:781] 1 1 1 1 1 1 1 1 1 1 ...
..$ call : language PCA(X = data[, vars], scale.unit = TRUE, graph = FALSE)
- attr(*, "class")= chr [1:2] "PCA" "list "
Ok, lets see look at the “screeplot”, a diagnostic visualization that displays the variance explained by every component. We here use the factoextra package, like for all following visualizations with the fviz_ prefix. Notice that the output in every case is an ggplot2 object, which could be complemented with further layers.
As expected, we see that the first component already captures a main share of the variance. Let’s look at the corresponding eigenvalues.
For feature selection, our rule-of-thumb is to only include components with an eigenvalue > 1, meaning that we in this case would have reduced our data to 4 dimensions. Lets project them onto 2-dimensional space and take a look at the vector of our features.
We see that they tend to cluster in 3 groups:
Lets look at the numeric values.
Principal Component Analysis Results for variables
===================================================
Name Description
1 "$coord" "Coordinates for the variables"
2 "$cor" "Correlations between variables and dimensions"
3 "$cos2" "Cos2 for the variables"
4 "$contrib" "contributions of the variables"
The results-object also contains the observations loading on the components.
Principal Component Analysis Results for variables
===================================================
Name Description
1 "$coord" "Coordinates for the variables"
2 "$cor" "Correlations between variables and dimensions"
3 "$cos2" "Cos2 for the variables"
4 "$contrib" "contributions of the variables"
Let’s visualize our observations and the variable-loading together in the space of the first 2 components.
We cal also briefly check if our ndimensionality reductions is helpful to differentiate between nomadscore.
Just a sidenote (which might become more important in later lectures): The components delivered by a PCA can also be used to create distance or similarity measures between two observations. You might need to refresh a bit of your vector algebra for the different ways to crate distance measures between vectors. Here, we just will use the simple “euclidian” distance in n-dimensional space. This can be done with the base-R dist() function. However, the FactoMineR has a function get_dist() which I prefer, since it includes a couple of other useful distance measurs.
The resulting distance object can be tansformed in a distance matrix. We will also add names for the matrix dimensions.
This matrix we could, for example, use to create a distance metwork (as we will do in M2). When continuing with “tidy” data, we would like to transform it in what we in network-jargon call a “edgelist”. So, that’s a classical use of the gather() function. However, since we have a matrix here, I will use the melt() function of the reshape2 package, since it automatically tidies the names of amtrix dimensions.Notice: This is a very convenient but not the most efficient way to create distance edgelists. In case we have a very large number of entities, you might want to learn how to deal with sarse-matrices. More on that again in M2.
Attaching package: <U+393C><U+3E31>reshape2<U+393C><U+3E32>
The following objects are masked from <U+393C><U+3E31>package:data.table<U+393C><U+3E32>:
dcast, melt
The following object is masked from <U+393C><U+3E31>package:tidyr<U+393C><U+3E32>:
smiths
Ok, lets just take a brief look which cities are most similar, and most distant in terms of their characteristics.
Sidenote: Here, we created the distance based on al components equally. Instead, one could weight the distance by the component’s variance explained, that the most explanatory component gets higher weights. That would be a nice exercise.
Such distance edgelists can be extremely informative. However, we will for not not use it anymore in the analysis to come, so lets get rid of the big objects.
Clustering can be broadly divided into two subgroups:
Clustering algorithms can also be categorized based on their cluster model, that is based on how they form clusters or groups. This tutorial only highlights some of the prominent clustering algorithms.
Connectivity-based clustering: the main idea behind this clustering is that data points that are closer in the data space are more related (similar) than to data points farther away. The clusters are formed by connecting data points according to their distance. At different distances, different clusters will form and can be represented using a dendrogram, which gives away why they are also commonly called hierarchical clustering. These methods do not produce a unique partitioning of the dataset, rather a hierarchy from which the user still needs to choose appropriate clusters by choosing the level where they want to cluster. Note: They are also not very robust towards outliers, which might show up as additional clusters or even cause other clusters to merge.
Centroid-based clustering: in this type of clustering, clusters are represented by a central vector or a centroid. This centroid might not necessarily be a member of the dataset. This is an iterative clustering algorithms in which the notion of similarity is derived by how close a data point is to the centroid of the cluster. k-means is a centroid based clustering, and will you see this topic more in detail later on in the tutorial.
Distribution-based clustering: this clustering is very closely related to statistics: distributional modeling. Clustering is based on the notion of how probable is it for a data point to belong to a certain distribution, such as the Gaussian distribution, for example. Data points in a cluster belong to the same distribution. These models have a strong theoritical foundation, however they often suffer from overfitting. Gaussian mixture models, using the expectation-maximization algorithm is a famous distribution based clustering method.
Density-based methods: search the data space for areas of varied density of data points. Clusters are defined as areas of higher density within the data space compared to other regions. Data points in the sparse areas are usually considered to be noise and/or border points. The drawback with these methods is that they expect some kind of density guide or parameters to detect cluster borders. DBSCAN and OPTICS are some prominent density based clustering.
So, what is the best to use? Hard to say. Clustering is an subjective task and there can be more than one correct clustering algorithm. Every algorithm follows a different set of rules for defining the ‘similarity’ among data points. The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one clustering algorithm over another. An algorithm might work well on a particular dataset but fail for a different kind of dataset. Since there is most times no wrong or right, the clustering that delivers the most useful results is the way to go.
K-means clustering is the most commonly used unsupervised machine learning algorithm for dividing a given dataset into k clusters, which must be provided by the user. The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized. There are several k-means algorithms available. However, the standard algorithm defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid. Its an iterative process containing the following steps
k - the number of clusters to be created.k objects from the dataset as the initial cluster centers. k clusters recompute the cluster centroid by calculating the new mean value of all the data points in the cluster. R uses 10 as the default value for the maximum number of iterations). So, lets do that. As already mentioned, we have to upfront choose our k. However, there exists some guidance, for example the highest gain in “total within sum of sqares” (fast to calculate), the “siluette”, as well as the “gap statistics” (hard to calculate, takes time).
Ok,w e here settle for 3 (executive desicion). Before we start, something weird upfront. The function takes the observation names from the rownames (which nobody uses anymore, and are depreciated by dplyr). So, remeber to define them just straight before you cluster, otherwise the next dplyr pipe will delete them again.
Oflets run the algorythm.
List of 9
$ cluster : Named int [1:781] 3 3 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr [1:781] "Budapest" "Chiang Mai" "Prague" "Taipei" ...
$ centers : num [1:3, 1:21] -0.671 0.69 -0.377 -0.415 0.417 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "1" "2" "3"
.. ..$ : chr [1:21] "cost_nomad" "cost_coworking" "cost_expat" "coffee_in_cafe" ...
$ totss : num 16380
$ withinss : num [1:3] 2475 4794 2592
$ tot.withinss: num 9861
$ betweenss : num 6519
$ size : int [1:3] 237 341 203
$ iter : int 3
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
Again, lets visualize it. To have a meaningful way for 2d visualization, we again project the observations on the space of the first 2 components.
Ok, we got 3 clusters. Let’s look what’s in them.
The key operation in hierarchical agglomerative clustering is to repeatedly combine the two nearest clusters into a larger cluster. There are three key questions that need to be answered first:
There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. Some of the common linkage methods are:
The choice of linkage method entirely depends on you and there is no hard and fast method that will always give you good results. Different linkage methods lead to different clusters.
Some further practical issues:
However, let’s get it started and perform a cluster. We here use the hcut function, which includes most of the abovementioned mapproaches as options.
In hierarchical clustering, you categorize the objects into a hierarchy similar to a tree-like diagram which is called a dendrogram. The distance of split or merge (called height) is shown on the y-axis of the dendrogram below.
Notice how the dendrogram is built and every data point finally merges into a single cluster with the height(distance) shown on the y-axis.
Let’s inspect what’s in the clusters.
And again visualize them:
Looks very similar, even though the middle cluster is a bit more sqeezed in between now. We can also use our scatterplot diagnostics again, and color the observations by their cluster assignment.
You might already have wondered: “COuld one combine a PCA with clustering techniques”? The answer is: “Yes!”. In practice, that actually works very fine, and often delivers more robust clusters. So, lets give it a shot. We could do it by hand, but the HCPC function already does that for us, and offers also a nice diagnostic viz.
To finish up, lets plot it in a map, simplest way possible.
You might have heard about the k-means clustering algorithm; if not, take a look at this tutorial. There are many fundamental differences between the two algorithms, although any one can perform better than the other in different cases. Some of the differences are:
Perhaps the most important part in any unsupervised learning task is the analysis of the results. After you have performed the clustering using any algorithm and any sets of parameters you need to make sure that you did it right. But how do you determine that?
Well, there are many measures to do this, perhaps the most popular one is the Dunn’s Index. Dunn’s index is the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter. The diameter of a cluster is the distance between its two furthermost points. In order to have well separated and compact clusters you should aim for a higher Dunn’s index.
Furthermore, graphical inspection often helps comparing the results of different algorithms and poarameters. Here you find some advanced diagnostic visualizations for hirarchical clustering.
Lastly, a clusters quality is to a large extend determined by its usefulness. * Internal Validity * External Validity
So, why not have some fun on your own now? try to use what you learned up to now in the following extercise. —> HERE <— you will find a dataset on Gert Hofstede’s “6-D model of national culture”“. This popular measures of country-level culture in (by now) 6 dimensions became very popular in sociology, economics, and management science to explain cross-cultural interaction as well as frictions. a exhaustive documentation of the 2013 dataset can be found here. It contains the following variables.
pdi: The power distance index is defined as “the extent to which the less powerful members of organizations and institutions (like the family) accept and expect that power is distributed unequally.”In this dimension, inequality and power is perceived from the followers, or the lower level. A higher degree of the Index indicates that hierarchy is clearly established and executed in society, without doubt or reason. A lower degree of the Index signifies that people question authority and attempt to distribute power.idv: This index explores the “degree to which people in a society are integrated into groups.”" Individualistic societies have loose ties that often only relates an individual to his/her immediate family. They emphasize the “I” versus the “we”. Its counterpart, collectivism, describes a society in which tightly-integrated relationships tie extended families and others into in-groups. These in-groups are laced with undoubted loyalty and support each other when a conflict arises with another in-group.mas: In this dimension, masculinity is defined as “a preference in society for achievement, heroism, assertiveness and material rewards for success.”" Its counterpart represents “a preference for cooperation, modesty, caring for the weak and quality of life.” Women in the respective societies tend to display different values. In feminine societies, they share modest and caring views equally with men. In more masculine societies, women are somewhat assertive and competitive, but notably less than men. In other words, they still recognize a gap between male and female values. This dimension is frequently viewed as taboo in highly masculine societies.uai: The uncertainty avoidance index is defined as “a society’s tolerance for ambiguity,” in which people embrace or avert an event of something unexpected, unknown, or away from the status quo. Societies that score a high degree in this index opt for stiff codes of behavior, guidelines, laws, and generally rely on absolute truth, or the belief that one lone truth dictates everything and people know what it is. A lower degree in this index shows more acceptance of differing thoughts or ideas. Society tends to impose fewer regulations, ambiguity is more accustomed to, and the environment is more free-flowing.ltowvs: This dimension associates the connection of the past with the current and future actions/challenges. A lower degree of this index (short-term) indicates that traditions are honored and kept, while steadfastness is valued. Societies with a high degree in this index (long-term) views adaptation and circumstantial, pragmatic problem-solving as a necessity. A poor country that is short-term oriented usually has little to no economic development, while long-term oriented countries continue to develop to a point.ivr: This dimension is essentially a measure of happiness; whether or not simple joys are fulfilled. Indulgence is defined as “a society that allows relatively free gratification of basic and natural human desires related to enjoying life and having fun.” Its counterpart is defined as “a society that controls gratification of needs and regulates it by means of strict social norms. Indulgent societies believe themselves to be in control of their own life and emotions; restrained societies believe other factors dictate their life and emotionsOk, looks interesting. Let’s do the fololwing:
Have fun!